Vectorize Enumerable.Range(...).ToArray()#80633
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-linq Issue Detailsnull
|
|
Numbers (there is sadly a lot of variability on my system so please consider median too): (data is for direct |
|
Thanks, @neon-sunset, for working on this. I've avoided doing this kind of vectorization in the past because I couldn't find any meaningful use outside of tests, and the impact on maintainability / size wasn't worth it. Can you share any information on where you see this optimization benefiting consumers? Any examples you can point to? Thanks! |
Sorry, I cannot. The initial implementation was done simply because .NET numbers weren't competitive with Rust or C++ and it was fun to iterate with BDN assembly output to achieve parity. As far as usages in Also, List initialization can still be improved even without vectorization through providing |
|
Yes. Now which of those aren't:
? |
|
Pushed a fix for impl. and feedback. Will close the PR since a change is deemed inappropriate for Will also open a separate issue because materialization to |
To be clear, I'd be happy if we could demonstrate this was valuable. I just haven't seen enough evidence of that yet. |
There was a problem hiding this comment.
You shouldn't need unsafe code here to avoid bounds checks.
int cur = _start;
for (int i = 0; i < destination.Length; i++)
{
destination[i] = cur;
cur++;
}There was a problem hiding this comment.
This won't be inlined. I'd expect this to regress for very short ranges due to the extra method call that wasn't there before.
* Remove `RangeIterator.ToArray` bounds check #80633 (comment) * Eliminate additional `ToArray` bounds checks
* Remove `RangeIterator.ToArray` bounds check dotnet#80633 (comment) * Eliminate additional `ToArray` bounds checks
|
@neon-sunset Could you report new BenchmarkDotNet results after #81001? |
|
Hi @neon-sunset. Thanks for getting this contribution going. There are a couple of feedback comments that still need to be addressed, and there's a conflict that has emerged too. I'm going to mark this as |
|
Hi @jeffhandley Thanks, I will address feedback later this week. |
296f2bc to
4ca8590
Compare
|
Failures seem to be unrelated, I think? Below numbers are for direct BenchmarkDotNet=v0.13.3, OS=macOS 13.2.1 (22D68) [Darwin 22.3.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK=8.0.100-preview.2.23121.17
[Host] : .NET 8.0.0 (8.0.23.11601), Arm64 RyuJIT AdvSIMD
DefaultJob : .NET 8.0.0 (8.0.23.11601), Arm64 RyuJIT AdvSIMD
NativeAOT 8.0 : .NET 8.0.0-preview.2.23116.1, Arm64 NativeAOT AdvSIMD
| Method | Job | Runtime | Length | Mean | Error | StdDev | Ratio |
|-------------- |-------------- |-------------- |-------- |--------------:|------------:|------------:|------:|
| ToArray | DefaultJob | .NET 8.0 | 1 | 3.025 ns | 0.0167 ns | 0.0156 ns | 1.00 |
| ToArrayPR | DefaultJob | .NET 8.0 | 1 | 3.066 ns | 0.0082 ns | 0.0069 ns | 1.01 |
| | | | | | | | |
| ToArray | NativeAOT 8.0 | NativeAOT 8.0 | 1 | 3.294 ns | 0.0119 ns | 0.0111 ns | 1.00 |
| ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 | 1 | 3.504 ns | 0.0036 ns | 0.0028 ns | 1.07 |
| | | | | | | | |
| ToArray | DefaultJob | .NET 8.0 | 5 | 5.767 ns | 0.0210 ns | 0.0196 ns | 1.00 |
| ToArrayPR | DefaultJob | .NET 8.0 | 5 | 5.991 ns | 0.0241 ns | 0.0226 ns | 1.04 |
| | | | | | | | |
| ToArray | NativeAOT 8.0 | NativeAOT 8.0 | 5 | 5.707 ns | 0.0130 ns | 0.0121 ns | 1.00 |
| ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 | 5 | 5.475 ns | 0.0175 ns | 0.0155 ns | 0.96 |
| | | | | | | | |
| ToArray | DefaultJob | .NET 8.0 | 10 | 7.837 ns | 0.0246 ns | 0.0230 ns | 1.00 |
| ToArrayPR | DefaultJob | .NET 8.0 | 10 | 5.552 ns | 0.0199 ns | 0.0176 ns | 0.71 |
| | | | | | | | |
| ToArray | NativeAOT 8.0 | NativeAOT 8.0 | 10 | 9.208 ns | 0.0270 ns | 0.0253 ns | 1.00 |
| ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 | 10 | 5.989 ns | 0.0116 ns | 0.0108 ns | 0.65 |
| | | | | | | | |
| ToArray | DefaultJob | .NET 8.0 | 100 | 50.587 ns | 0.0668 ns | 0.0592 ns | 1.00 |
| ToArrayPR | DefaultJob | .NET 8.0 | 100 | 23.730 ns | 0.0718 ns | 0.0600 ns | 0.47 |
| | | | | | | | |
| ToArray | NativeAOT 8.0 | NativeAOT 8.0 | 100 | 45.158 ns | 0.1053 ns | 0.0985 ns | 1.00 |
| ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 | 100 | 21.474 ns | 0.0487 ns | 0.0407 ns | 0.48 |
| | | | | | | | |
| ToArray | DefaultJob | .NET 8.0 | 100000 | 46,401.337 ns | 636.9110 ns | 595.7669 ns | 1.00 |
| ToArrayPR | DefaultJob | .NET 8.0 | 100000 | 23,084.711 ns | 350.5279 ns | 327.8841 ns | 0.50 |
| | | | | | | | |
| ToArray | NativeAOT 8.0 | NativeAOT 8.0 | 100000 | 63,969.073 ns | 182.6120 ns | 170.8154 ns | 1.00 |
| ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 | 100000 | 40,710.868 ns | 148.9555 ns | 124.3846 ns | 0.64 | |
I'm surprised this is measurable, since runtime/src/coreclr/System.Private.CoreLib/src/System/GC.CoreCLR.cs Lines 658 to 664 in 7829252 |
So the best case ratio is 0.47, presumably with much larger code size. |
|
@krwq What do you think is the right path forward for this PR? |
|
@jeffhandley I will take another go at the change to make it more similar to how |
|
@neon-sunset Might be worth a read: #84115. |
|
This pull request has been automatically marked |
|
This pull request will now be closed since it had been marked |
Loop body is almost the same to what LLVM produces for Rust's
let vec: Vec<i32> = (0..count).collect()withx86-64-v3target except the unroll factor it chooses is 4 while here we pick 2 instead because it saturates store throughput on Zen 3 anyways and has much smaller codegen size.NB: on ARM64 it emits a pair of
strs vs LLVM's singlestpbut I'm not sure if that's an optimization .NET can do today out of two adjacentUnsafe.WriteUnaligned.Codegen (.NET 8 daily build + FullPGO):
This is a port of https://github.com/neon-sunset/RangeExtensions/blob/main/RangeExtensions/RangeEnumerable.ICollection.cs#L144 except it doesn't need to support bi-directional scenario.